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Abstract 

This  paper  provides  a  theorem  which  illustrates  why  a  general  adaptive  feed-forward 
layered  network  with  linear  output  units  can  perforin  well  as  a  pattern  classification 
device.  The  central  result  is  that  minimising  the  error  at  the  output  of  the  network 
is  equivalent  to  maximising  a  particular  norm,  the  Network  Cost  Function,  at  the 
output  of  the  hidden  units.  If  the  total  covariance  matrix  is  full  rank  and  the  targets 
are  appropriately  chosen,  then  this  cost  function  relates  the  inverse  of  the  total 
covariance  matrix  and  the  weighted  between  class  covariance  matrix  of  the  hidden 
unit  patterns.  In  a  linear  network  it  is  shown  how  our  theorem  can  reproduce  the 
result  recently  obtained  by  Gallinari  et.al.  as  a  special  case.  We  present  numerical 
simulations  to  illustrate  the  theorem  and  to  show  that  alternative  choices  for  the  cost 
function  at  the  hidden  layer  are  not  maximised,  generally,  in  a  nonlinear  situation. 
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1  Introduction. 

Adaptive  feed-forward  layered  networks  as  exemplified  by  the  Multi-later  Pereeptron  are 
known  to  be  particularly  useful  as  pattern  classification  techniques  (see  for  instance  [1,2]). 
What  is  not  understood  is  why  they  perforin  good  classification,  and  what  underlying 
mechanism  is  responsible. 

In  certain  instances,  it  is  possible  to  identify  the  action  of  a  layered  network  structure 
with  the  operation  of  a  conventional  classification  scheme.  For  instance,  a  linear  Pereeptron 
performing  an  auto-associative  task  is  equivalent  to  a  Principal  Component  analysis  of  the 
data  [3j.  This  result  was  generalised  by  Gallinari  et.el.  [4]  to  a  linear  pereeptron  perform¬ 
ing  an  hetero-associative  function.  Thie  work  was  important  because  it  made  explicit  the 
fact  that  a  linear  network  performing  a  one-from-JV  classification  to  minimise  the  total 
mean-square  output  error,  did  so  by  implicitly  maximising  the  ratio  of  the  determinants  of 
the  between  class  and  total  covariance  matrices.  If  a  (linear)  transformation  of  the  input 
data  can  be  made  which  produces  an  adequate  separation  of  the  classes  as  determined  by 
the  between  class  and  total  covariance  matrices,  then  a  subsequent  linear  sectioning  proce¬ 
dure  should  produce  good  classification  results.  Thus  it  is  clear  why  the  linear  Multi-layer 
Pereeptron  is  capable  of  performing  well  in  such  circumstances.  However,  for  a  more  inter¬ 
esting  nonlinear  transformation  from  the  input  data  to  the  (usually,  though  not  necessarily, 
dimension-reducing)  space  spanned  by  the  hidden  units  much  less  is  known  outside  empir¬ 
ical  observation.  This  nonlinear  transformation  may  be  the  usual  logistic  transformation  of 
the  scalar  products  between  input  vectors  and  weight  vectors  as  in  the  traditional  multi¬ 
layer  pereeptron.  Alternatively,  it  may  be  the  nonlinear  transformation  of  the  norm  of  the 
vector  difference  between  input  data  and  weight  vectors,  as  in  the  Radial  Basis  Function 
network  [S]. 

This  study  reports  theoretical  and  numerical  results  on  a  subclass  of  general  layered 
nonlinear  feed-forward  adaptive  networks  which  demonstrate  why  such  networks  have  the 
ability  to  perform  nonlinear  discriminant  analysis  successfully. 

The  object  of  interest  in  this  paper  is  that  class  of  layersd  feed-forward  network  which 
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takes  input  data,  performs  an  arbitrary  nonlinear  transformation  to  a  space  controlled  by 
‘hidden’  units,  and  finally  performs  a  linear  transformation  which  attempts  to  minimise  the 
mean-square  error  to  a  set  of  known  output  targets.  This  network  class  will  be  made  more 
explicit  in  the  next  section.  By  a  theoretical  study  of  this  structure,  it  will  be  apparent  that 
a  good  discrimination  between  classes  in  the  space  of  the  hidden  units  is  obtained  implicitly, 
by  requiring  a  minimisation  of  the  output  error.  However,  the  cost  function  maximised  at 
the  outputs  of  the  hidden  layer  is  not  a  common  choice  in  conventional  linear  or  nonlinear 
discriminant  analysis  (see  for  instance  (4,6]).  It  will  be  shown  that,  for  the  linear  layered 
network,  maximisation  of  the  proposed  cost  function  is  equivalent  to  the  maximisation  of 
more  popular  cost  functions  associated  with  discriminant  analysis. 

2  Discussion  of  the  Network. 

This  section  discusses  the  general  class  of  adaptive  feed-forward  networks,  and  the  subclass 
of  networks  relevant  to  the  theorem  proposed  in  section  3. 

In  conventional  feed-forward  layered  networks,  data  in  the  form  of  patterns  represented 
as  n  dimensional  (real  valued)  vectors  are  mapped  by  a  nonlinear  transformation  on  to  n' 
dimensional  target  vectors  in  the  following  fashion.  The  input  patterns  are  presented  to 
a  set  of  n  input  units.  Each  input  unit  is  totally  connected  to  a  set  of  no  ‘hidden’  units 
(hidden  from  direct  contact  with  the  environment).  Associated  with  each  link  between  the 
i-th  unit  in  the  input  layer  and  the  j- th  unit  in  the  hidden  layer  is  a  scalar  weight  value, 
Pij.  Usually,  the  fan-in  to  a  hidden  node  takes  the  form  of  a  hyperplane,  and  the  input 
to  node  j  is  of  the  form  8j  =  Uptj  where  I*  is  the  i-th  component  of  the  p-th  input 
pattern  vector.  In  the  case  of  the  radial  basis  function  network  [5],  this  fan-in  takes  the 
form  of  a  hypersphere,  i.e.  9j  =  ]£’’_,(/?  -  /i^)*  for  a  Euclidean  vector  norm.  The  role 
of  the  hidden  unit  is  to  accept  the  fan-in  and  to  pass  it  through  a  (generally)  nonlinear 
transfer  function 

=  Htoj  +  *>)  (1) 

where  ftoj  i*  *  local  ’bias’  associated  with  each  hidden  unit. 

The  hidden  layer  units  are  fully  connected  to  a  set  of  n'  output  units  corresponding  to 
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the  components  of  the  n'  dimensional  vector  in  the  output  ‘target’  space.  The  strength  of 
the  link  between  the  y-th  hidden  unit  and  the  fc-th  output  unit  is  Ays  and  thus  the  value 
received  at  the  fc-th  output  unit  is  0*  =  YlyLi  A In  general,  the  output  from  the  fc-th 
output  unit  is  a  nonlinear  function  of  its  input, 

Ot  =  **(  Aos  +  00  (2) 

where  Ao*  is  the  bias  associated  with  the  fc-th  output  unit. 

The  effect  of  this  class  of  networks  is  to  produce  an  interpolation  surface  [5,2]  in  the 
high  dimensional  space  Rn  ®  Rn'  which  is  entirely  determined  once  a  suitable  set  of  values 
for  the  parameters  {A,/*}  has  been  specified.  The  ability  of  the  network  subsequently  to 
generalise  depends  on  the  shape  of  this  interpolating  surface;  if  the  network  is  sufficiently 
complex,  it  will  be  possible  to  find  a  set  of  parameters  which  produces  an  interpolating 
surface  which  passes  exactly  through  the  set  of  training  patterns.  This  will  not  be  a  good 
generalisation  strategy  as  the  noise  in  the  data  will  also  have  been  fitted.  Alternatively,  if 
the  network  geometry  is  not  complex  enough,  there  will  not  be  a  choice  of  parameter  values 
which  will  allow  the  interpolating  surface  to  represent  the  relationships  in  the  training  data. 
This  paper  is  not  concerned  with  generalisation,  but  with  training,  and  what  it  means  to 
have  a  suitable  set  of  parameters  conditional  upon  the  training  data. 

The  parameters  of  the  network  are  determined  conditional  upon  a  set  of  P  input  training 
patterns,  fl/)*’  €  R"}  1  and  their  associated  targets  {|T)1’  €  Rn’},  p—  1,2 . P.  Conven¬ 

tionally,  the  parameters  are  chosen  so  as  to  minimise  the  mean-square  error  between  the 
actual  output  of  the  network  and  the  desired  target  patterns,  i.e.  the  aim  is  to  minimise 

£  =  Dl|T)'-|o>*>  II’ 

P=1 

(3) 

s  xm  ( Tk  -  *k  (ao* + jz  *>(«»■ + 

r=it=i  [  ,=i 

*In  tilt  ‘brs-kst*  notation  (or  vscton,  s  column  *  actor  (s, ,*>,...)  if  written  u  |z)  (the  ‘kef).  The 
corresponding  row  vector  is  denoted  (z|  (the  ‘bn’  vector).  A  scalar  product  between  |x)  and  (y|  is  given 
by  (y| s)  and  |y)(s|  is  a  linear  operator  with  matrix  elements  p.sy. 
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where  T*  ie  the  fc-th  component  of  the  p-th  ter  get  pattern  vector.  Since  this  is  an  explicit, 
differentiable  nonlinear  function  of  the  parameters  of  interest,  one  can  use  one  of  the  many 
nonlinear  optimisation  techniques  to  find  an  acceptable  local  minimum  [2]  which  will  give 
a  suitable  eet  of  weight  values.  Although  many  other  error  functions  may  be  chosen  to  be 
minimised  at  the  output  of  the  network,  we  will  see  that  this  particular  choice  has  merits 
for  discriminant  analysis. 


The  one  slight  variant  on  this  general  theme  that  the  rest  of  the  paper  requires,  is  that  the 
transfer  functions  on  the  output  units  are  linear;  4*(z)  =  z  V  k.  The  important  consequence 
of  this  restriction,  analytically  and  numerically,  is  that  the  weights  connecting  the  hidden 
units  to  the  output  units  may  be  analysed  by  linear  optimisation  methods.  In  particular, 
given  the  set  of  weights  connecting  the  input  to  hidden  units,  the  hidden-to-output  units 
may  be  adjusted  by  a  linear  least-mean -squares  method  to  produce  a  global  minimum  in 
the  error  subspace  spanned  by  the  set  of  weights  {Ays}.  Consequently,  given  this  latter  set  of 
weights,  the  initial  input-to-hidden  weights  may  be  adjusted  by  a  nonlinear  optimisation 
strategy  to  find  a  better  local  minimum  in  the  error  subspace  determined  by  the  set  of 
weights  {p,y}.  This  procedure  may  continue  iteratively.  For  every  'slow’  adjustment  of  the 
input-to-hidden  weights,  the  hidden-to-output  weights  respond  rapidly  always  maintaining 
the  global  error  minimum  in  that  subspace  -  the  output  weights  are  ‘slaved’  to  the  behavior 
of  the  input  weights  (for  a  numerical  comparison  between  this  hybrid  methodology  and 
solving  the  entire  set  of  weights  by  nonlinear  optimisation  see  [7]).  This  hybrid  method 
is  also  closely  related  to  the  solution  of  the  radial  basis  function  network,  if  the  radial 
basis  function  centres  (corresponding  to  the  knots  in  curve  fitting)  are  allowed  to  adjust 
themselves  by  nonlinear  methods. 


It  is  interesting  that  least-squares  error  minimisation  and  linear  output  units  in  a  general 
feed-forward  layered  network  illuminates  the  success  of  such  structures  in  discrimination 
problems  as  we  now  demonstrate. 
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3  Theoretical  Analysis. 

la  this  section,  the  error  which  is  minimised  at  the  output  of  the  network  will  be  analysed 
to  reveal  the  cost  function  that  is  implicitly  maximised  by  the  network  at  the  output  of 
the  hidden  units.  This  will  be  accomplished  in  two  stages.  Firstly,  it  will  be  shown  that 
the  effect  of  the  biases  at  the  output  of  the  network  is  to  ensure  that  the  mean  output  of 
the  network  equals  the  mean  target  pattern.  This  result  allows  the  output  biases  to  be 
removed  by  a  rescaling  of  the  target  vectors  and  the  outputs  of  the  hidden  units.  Secondly, 
the  error  in  terms  of  the  rescaled  variables  is  analysed  to  obtain  the  actual  cost  function 
which  is  being  maximised.  Further  subsections  will  consider  the  interpretation  of  the  result 
in  certain  limiting  circumstances. 

The  error  at  the  output  of  the  network  introduced  in  the  previous  section  may  be 
expressed  as 


E  =  1 1 AH  -  T||* 

=  Tr  ((AH  -  T)(H'A*  -  T*)] 


(4) 


where  A'  indicates  the  transpose  of  matrix  A  and  Tr  denotes  the  trace  operation.  The 
matrix  A  is  an  n'  x  (no  4-  1)  array  of  weight  values,  including  the  biases, 


Aoi 

An  • 

^nol 

Aon1 

Ain»  • 

^non* 

Matrix  T  is  an  n*  x  P  array  of  desired  ‘target’  values,  i.e.  P  vectors  each  of  length  n'. 
For  the  moment,  the  matrix  elements  of  the  target  array  will  be  denoted  ty.  Subsequently, 
these  values  will  be  restricted  to  either  sero,  unity,  or  the  reciprocal  of  the  square  root  of 
the  number  of  elements  in  each  class.  Matrix  H  is  the  (no  +  1)  x  P  array  of  the  P  output 
vectors  of  the  no  hidden  units  plus  a  unit  with  unity  output  to  feed  the  bias  weights.  These 
output  vectors  at  the  hidden  layer  are  treated  as  input  vectors  to  the  final  transformation 
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without  any  a  priori  assumptions  regarding  their  origins.  Matrix  H  may  be  expressed  as 


1  1  ...  1 

*1  4\  .. 

4lo  ■■■  ^no¬ 


where  j  =  1, . . . ,  no,  p  =  1 . Pit  the  output  value  at  the  y'-th  hidden  unit  corre¬ 

sponding  to  the  P-th  pattern. 

Note  that  the  matrix  (AH)  may  be  expressed  in  the  form 

AH  =  |A0){1|  +  A'H'  (7) 

where  |Ao)  denotes  the  bias  vector  over  all  n'  output  units,  and  (1|  denotes  a  row  vector  with 
•I’  in  every  position.  Thus,  A*  is  the  array  of  weights  in  the  final  layer,  not  including  the 
biases,  and  H'  is  the  njxP  array  of  elements  [d?].  This  nomenclature  has  been  introduced 
so  as  to  separate  out  the  effect  of  the  bias  vector.  The  reason  for  this  is  that  the  bias  vector 
compensates  for  the  mean  shift  at  the  output  of  the  layered  network,  and  so  it  may  be 
removed,  as  follows. 

If  we  minimise  E  with  respect  to  the  bias  vector  only  (i.e.  differentiate  E  with  respect 
to  |Ao)  and  equate  to  zero)  we  find  that 

|A0)=|r)-A'|mH)  (8) 

where 

Jtid 

is  the  mean  target  vector  with  components 

P=! 

and 

|m">  £  Ih'|1)  (II) 
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is  the  mean  output  vector  at  the  hidden  unite  where  the  i-th  component  is 


(12) 


Thus,  if  the  square  error  is  minimised,  the  optimum  biases  ensure  that  the  mean  output 
of  the  network  equals  the  mean  target  pattern. 

Substituting  equation  (8)  into  equation  (7)  and  then  equation  (7)  into  equation  (4) 
implies  that  the  error  to  be  minimised  by  the  weights  A'  may  be  expressed  equivalently  as 


E=||(T-|r)<l|)+A'(|mff>(l|-H')||t 
=  ||*  -  A'H||* 

with  the  definitions 


(13) 


T  =  T  -  |T)(1| 

A  (14) 

H  =  H'  - 

Thus,  we  now  wish  to  minimise  E  which  is  a  function  of  scaled  targets  and  inputs,  with 
respect  to  the  parameters  XjktlHj. 

It  is  known  [8]  that  the  matrix  which  minimises  the  error  E  with  minimum  (Frobenius) 
norm  is  the  Moore- Pen  rose  pseudo-inverse,  H+ ,  of  matrix  6.  The  pseudo-inverse  A+  of 
the  (rectangular)  matrix  A  is  that  unique  matrix  which  satisfies  the  relationships 

AA+A  =  A 
A+AA+  =  A+ 

(AA+)‘  =  AA+ 

(A+A)‘  =  A+A 


(15) 
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In  terms  of  this  generalised  matrix  inverse,  the  solution  of  the  weight  matrix  may  be 
expressed  as 

A'  =  TH+  (16) 

Since  the  error  to  be  minimised  is  the  trace  of  the  matrix  (T-A'H)(T-A'H)’,  substituting 
in  the  solution  for  A'  gives,  successively, 

E  =  Tr  {(T  -  A'H)(T  -  A'ft)-} 

=  Tr  |TT’ 

-  TH+HT*  -  TH'(H+)*T‘  (17) 

+  TH+HH‘(H+)'T'} 

=  Tr  {TT*  -  TH*(HH')+HT'| 


To  obtain  this  last  relation,  we  have  exploited  the  properties  of  the  pseudo-inverse  given 
in  (15)  and  the  fact  that  the  matrices  HH+  and  H+H  are  idempotent. 

Note  that  the  matrix  (HH*)  is  the  Total  Covariance  Matrix,  Sy  at  the  output  of  the 
hidden  units, 

Sr  =  HA*  =  £;(|^)-|mw>)((^|-(mw|)  (18) 

o=i 

Thus,  since  the  targets  are  fixed,  minimising  E  is  equivalent  to  maximising  the  cost 
function 

C  =  Tr  ji'H’SyHT' j  (19) 

This  is  the  Network  Coot  Function. 

If  we  consider  the  case  where  the  total  covariance  matrix  is  full  rank  (and  so  the  number 
of  independent  patterns  is  greater  than  the  number  of  hidden  units)  then  the  rank  of  S y  is 
no  and  the  pseudo-inverse  is  the  true  inverse, 

Sy  =  Sf 1  if  Sy  is  full  rank.  (20) 

In  a  linear  problem,  this  would  correspond  to  an  overdetermined  situation  which  does 
not  have  a  unique  solution  generally.  However,  this  is  usually  the  case  which  occurs  in 
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pattern  classification  tasks  where  a  sufficient  number  of  representative  samples  need  to  be 
considered  so  that  noise  effects  superimposed  on  the  ‘clean’  patterns  may  be  compensated 
for. 


We  will  assume  that  the  total  covariance  matrix  is  full  rank  in  what  follows,  however 
the  actual  network  cost  function  (19)  is  not  restricted  by  this  assumption.  Under  the  full 
rank  restriction,  it  can  easily  be  shown  (see  Appendix  A)  that  maximising  the  Network 
Cost  Function  is  equivalent  to  maximising  the  cost  function 


C  =  Tr  {SgSf  *}  (21) 

where 

SB  =  HT'TH'  (22) 


Equations  (19)  and  (21)  are  the  principal  results  of  this  paper: 

Minimising  the  square  error  at  the  output  of  the  adaptive  layered  network,  is  equiv¬ 
alent  to  maximising  the  Network  Cost  Function  (19)  at  the  outputs  of  the  hidden 
units.  If  the  total  covariance  matrix  is  full  rank  then  the  Network  Cost  Function  is 
the  trace  of  the  product  of  a  matrix  Sg  and  the  inverse  of  the  total  covariance  matrix 
of  the  patterns  at  the  outputs  of  the  hidden  units  (21). 

The  following  two  subsections  provide  an  interpretation  of  matrix  Sb  for  specific  choices 
of  the  output  coding  for  the  target  vectors. 

S.l  Particular  target  coding  schemes. 

Consider  the  specific  choice  of  a  one-from -n'  coding  scheme.  Along  with  the  other  as¬ 
sumptions  made  on  the  form  of  the  adaptive  layered  network  structure,  the  desired  target 
value  of  a  particular  pattern  is  unity  if  the  chosen  input  pattern  is  in  that  class,  and  is  zero 
otherwise.  If  there  are  n'  classes,  Cu,k  =  with  n*  patterns  in  class  C*,  then  for 
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This  equation  is  recognised  as  the  expression  for  the  weighted  between  class  covariance 
matrix.  Thus,  for  a  one-from-n'  output  coding  (which  is  very  common  in  the  literature)  the 
layered  network  maximises  a  cost  function  which  is  the  trace  of  the  product  of  the  weighted 
between  class  covariance  matrix,  and  the  inverse  of  the  total  covariance  matrix.  This  is  an 
interesting  result,  since  it  illustrates  how  adaptive  layered  networks  implicitly  incorporate 
the  proportions  of  samples  within  each  class  as  priors. 

Consider  an  alternative  coding  scheme:  the  target  of  an  input  pattern  is  sero  if  the 
pattern  is  not  in  the  class  under  consideration  and  is  the  reciprocal  of  the  square  root  of 
the  number  of  patterns  in  that  class  otherwise. 

t  ^  I WR*  if  W  €  Ck 

kp  \  0  otherwise 

In  this  case,  the  matrix  8j g  expands  to 

=  £  nk  (|mf )  -  |m* ))  ((m?  |  -  (m^l)  (25) 

k=l 

which  is  the  conventional  {not  weighted  hy  priori)  between  class  covariance  matrix.  Thus, 
in  a  pattern  classification  problem  which  would  be  solved  best  by  producing  a  pattern 
distinction  which  nniformly  weights  the  classes,  an  adaptive  layered  network  trained  on  a 
one-from-n'  coding  scheme  would  not  produce  the  best  results.  For  instance,  in  modelling 
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continuous  speech  pstteraa  the  bulk  of  the  acoustic  vectors  may  represent  silence.  To  ensure 
that  silence  did  not  dominate  the  classification  performance,  but  instead  concentrated  on 
the  more  relevant  information-bearing  acoustic  vectors,  the  adaptive  layered  network  itself 
should  not  incorporate  prior  knowledge  on  the  frequency  of  occurrence  of  patterns  within 
each  class  for  the  between  class  covariance  matrix.  In  order  to  force  the  network  weights 
not  to  bias  in  favour  of  the  classes  with  largest  membership,  the  prior  knowledge  of  pattern 
distribution  has  to  be  encoded  in  the  target  vectors.  In  such  experiments,  unevenly  dis¬ 
tributed  training  patterns  between  classes  may  be  compensated  for  by  scaling  the  1-from-n' 
target  vectors  by  the  square  root  of  the  number  in  each  class. 


Two  particular  instances  where  the  distinction  between  the  weighted  and  not-weighted 
between  class  covariance  matrices  will  not  be  made  occur  when  the  number  of  patterns  in 
each  class  is  the  same  and  in  a  two-elees  problem.  In  this  latter  case,  the  weighted  between 
class  covariance  matrix,  8 b,  and  the  conventional  between  class  covariance  matrix,  §g ,  are 
connected  by  a  multiplicative  constant. 


_  2n1n1~ 
Sfl  =  — — 


Thus,  maximising  the  cost  function  with  the  weighted  covariance  will  give  the  same  result 
as  maximising  with  the  conventional  covariance  matrix  for  a  two  class  problem. 


3.2  The  Linear  Adaptive  Layered  Network. 


The  final  special  case  to  consider  is  when  the  bidden  units  are  constrained  to  a  linear 
transfer  function  of  the  fan-in  {^;(x)  =  *  V  j).  In  this  case,  the  adaptive  layered  network 
as  a  whole  performs  a  linear  transformation  between  the  input  and  output  spaces.  This  was 
the  situation  considered  theoretically  by  Gallinari  et.al.  (4j.  The  result  which  they  proved 
(under  certain  reasonable  assumptions)  was  that  the  weight  matrix  between  the  input  and 
hidden  units  which  minimised  the  square  error  of  the  network,  also  maximised  the  cost 
function 


d(W')  = 


|W’'S{,W'| 

iW'*S£W’| 


(26) 


which  is  the  ratio  of  the  determinants  of  the  between  class,  and  total  covariance  matrices  in 
the  transformed  space  of  the  input  patterns.  In  this  equation,  8^  and  Sf  are  the  between 
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class  end  total  covariance  matrices  of  the  original  input  data.  It  was  not  clear  what  other 
types  of  cost  function  would  be  maximised  also  by  the  tame  matrix  W,  although  the  above 
choice  is  a  reasonable  one  to  make  for  discrimination  analysis.  However,  this  choice  is  not 
the  natural  cost  function  which  the  network  is  implicitly  attempting  to  maximise,  as  we 
have  illustrated. 

It  is  interesting  to  see  if  the  result  presented  in  equation  (26)  can  reproduce  the  max¬ 
imisation  of  C(W)  (21)  in  the  limit  of  a  linear  network.  It  will  be  shown  that  this  is  indeed 
the  case.  The  first  part  of  the  illustration  removes  the  effects  of  the  biases  by  demonstrating 
that  the  hidden  unit  bias  vector  compensates  for  the  difference  between  the  mean  vector 
over  all  patterns  at  the  output  of  the  hidden  nnita  and  the  transform  of  the  mean  of  all  the 
input  patterns.  The  second  part  demonstrates  that  the  matrix  equation  satisfied  by  the 
weights  which  maximises  the  cost  function,  C,  is  the  almost  the  same  as  the  equation  for 
the  matrix  which  maximises  the  cost  function  C.  Thus,  the  explicit  solution  for  the  weights 
which  maximises  C  will  also  maximise  C. 

If  the  network  is  linear,  then  the  output  matrix  of  the  hidden  units,  H’,  which  is  of  size 
no  x  P,  is  obtained  by  a  linear  transformation  of  the  (n  +  1)  x  P  array  of  input  patterns,  I 
by  the  no  x  (n  +  1)  weight  matrix  W  (note  that  W  and  I  include  the  effect  of  the  biases, 

M), 

H'  =  WI 


As  illustrated  previously,  the  effect  of  the  biases  may  be  separated  by  decomposing  the 
matrix  H'  into 

H'  =  WT  +  !po)(I|  (27) 

where  W  is  of  dimension  no  x  n,  I'  is  of  dimension  n  x  P,  |po)  is  an  no  dimensional  column 
vector  and  (1|  is  an  no  dimensional  unit  constant  row  vector. 

Since  H'|l)  =  P|mH),  using  the  above  equation  gives  the  result  that 

|po>  =  |m*>  -  WV>  (28) 

where  |mJ)  is  the  mean  input  pattern. 
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Thus,  in  the  case  of  linear  hidden  units,  the  hidden  unit  bias  vector  is  to  compensate 
for  the  difference  between  the  actual  mean  of  the  output  patterns  of  the  hidden  units,  and 
the  linear  transformation  of  the  mean  of  the  input  patterns. 

Substituting  bach  for  )/jo)  in  (27)  allows  the  bias  to  be  removed  by  considering  mean- 
shifted  input  patterns  and  hidden  unit  outputs:- 


H'  -  =  WT  -  W'|m,)(l| 

=  H  =  W’l 


(29) 


where 


ft  =  H'-|mw)(l| 

(30) 

liT-|m')(l| 


In  terms  of  the  rescaled  input  patterns  and  hidden  unit  outputs,  the  matrices  Sg  and 
Sy  may  be  expressed  as 


8b  =  W'iT-T  i-w'* 
=  w's^w’* 


(31) 


Sr  =  w'n*w’* 
=  w'sfw'* 


(32) 


where  S^  and  are  the  'between  class’  (provided  the  targets  are  appropriately  chosen)  and 
total  covariance  matrices  of  the  input  patterns.  Thus,  in  terms  of  these  matrices  associated 
with  the  input  data,  the  cost  function  that  the  network  is  attempting  to  maximise  is 

C  =  Tr  { W'S^W’*  (w'SfW'')  ">}  (33) 

which  is  performed  by  an  appropriate  choice  of  the  weight  matrix  W*  (the  links  between 
the  input  and  hidden  units). 
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The  next  part  of  this  illustration  is  to  show  that  any  matrix  which  maximises  the  cost 
function 


d(W')  = 


|W’’StW'| 

|W‘*s|w'j 


(34) 


also  maximises  the  cost  function  C(W')  (33). 


Taking  the  derivative  of  the  cost  function  C(W)  with  respect  to  the  elements  of  matrix 
W  and  equating  to  sero,  gives  the  matrix  equation 

2  (W'S^W’')'1  |w'Si  -  W'S^W'*  (W'SJ-W’*)"1  W'Sj.j  =  0  (35) 


Similarly,  the  derivative  of  the  cost  function  C(W')  with  respect  to  W'  equated  to  zero 
gives  the  matrix  equation:- 


JW'S&W'I 

|W'S£W'*| 


(w'S^W'*)"1  [w's£ 


w'skw’^w'sfw'T’w'sf]  =0 


(36) 


The  derivation  of  this  equation  has  assumed  that  the  rank  of  8^  is  at  least  no  so  that 
the  inverse  (and  hence  the  derivative  of  the  determinant)  exists. 

By  comparing  equations  (35),  and  (36),  it  is  evident  that  a  nontrivial  solution  for  matrix 
W  which  satisfies  (36)  must  also  satisfy  (35).  In  particular,  any  W*  which  satisfies  the 
generalised  eigenvalue  equation, 

W's£  =  rw'sj.  (37) 

where  T  is  a  diagonal  matrix  of  eigenvalues,  satisfies  (36)  and  maximises  the  cost  function 
&  with  a  value  of  |r|  which  is  the  product  of  the  eigenvalues.  This  solution  also  satisfies 
(35)  and  maximises  the  Network  Cost  Function,  C  with  a  value  of  TrT  which  is  the  sum  of 
the  eigenvalues. 

Thus  W*  may  be  composed  out  of  the  eigenvectors  of  S^fS^)-1  corresponding  to  the 
non  sero  eigenvalues  (giving  a  specific  solution  Wo).  Note  that  any  linear,  invertible  trans¬ 
formation,  V,  of  Wo,  will  also  maximise  these  two  coat  functions;  hence  the  solution  is  not 
unique. 
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However,  there  exist  solutions  of  (36)  which  do  net  maximise  the  cost  function  C  but 
which  do  still  maximise  the  Network  Cost  Function  C.  For  instance,  the  matrix  W  = 
VW0  +  M  where  the  columns  of  M*  lie  in  the  null  space  of  8^  (so  that  SgM"  =  0)  still 
satisfies  the  error  minimisation  equations,  but  it  does  not  maximise  the  cost  function  C. 
However,  the  Network  Cost  Function  is  not  altered  by  the  addition  of  such  a  null  subepece 
matrix.  Therefore,  there  exist  solutions  which  minimise  the  network  error  and  maximise  the 
network  cost  function,  but  which  do  not  maximise  the  cost  function  proposed  by  Gallinari 
et.al..  Note  that  if  a  minimum  norm  solution  is  demanded,  then  matrix  W'  must  lie  entirely 
within  the  image  of  8^  and  both  cost  functions  are  maximised  simultaneously. 

4  Numerical  Illustration. 

In  this  section,  the  implications  of  the  theorem  in  Section  3  are  illustrated.  The  problem  is 
to  determine  the  number  of  connected  groups  of  I  ’» in  an  8-bit  binary  string  (the  'Contiguity 
Problem’).  Thus,  there  are  256  distinct  patterns,  each  of  which  belongs  to  one  of  5  classes. 
Table  1  gives  the  number  of  members  of  each  class.  Note  that  this  is  not  a  ‘typical’  problem; 
there  is  no  true  concept  of  noise  in  the  data  and,  being  a  Boolean  problem,  it  does  not  really 
make  sense  to  discuss  its  statistics  in  terms  of  covariance  matrices.  Nevertheless,  it  is  a 
simple  problem  with  more  than  two  classes  to  discriminate  and  with  a  disparate  number  of 
patterns  in  each  class.  Despite  the  lack  of  an  intuitive  interpretation  of  covariance  matrices 
for  such  a  problem,  it  should  still  be  true  that  the  Network  Cost  Function  is  maximised  - 
the  issue  is  not  what  value  the  Network  Cost  Function  attains  (which  will  presumably  be 
larger  for  Gaussian  distributed  input  patterns)  but  whether  the  value  that  it  does  reach 
is  the  largest  possible  value  consistent  with  the  problem  and  the  network.  Thus  we  have 
chosen  this  problem  to  be  illustrative  of  the  general  results  we  have  derived. 

The  specific  nonlinear  transformation  from  the  input  layer  to  the  hidden  layer  employed 
in  this  test  is  that  transformation  as  determined  by  a  multi-layer  perceptron  with  8  input 
units,  5  output  units  and  a  varying  number  of  hidden  units  as  the  classification  network. 
The  output  units  have  a  linear  transfer  function.  The  output  coding  scheme  adopted  is  a 
one-from-five  coding,  so  that  the  matrix  Ss  is  given  by  equation  (23).  Since  there  are  an 
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Class 

Number  of 
connected  groups 

Number  of 
members  in  class 

1 

0 

1 

2 

1 

36 

3 

2 

126 

4 

3 

84 

5 

4 

9 

Table  1:  Number*  of  member*  is  each  clan  for  the  connected  groups  of  digit  l’a 

unequal  number  of  members  in  each  clan,  this  matrix  is  not  proportional  to  the  conventional 
between  clan  covariance  matrix,  §n  (25). 

Four  cost  functions  were  evaluated  at  the  output  of  the  bidden  units,  namely  the  Network 
Cost  Function,  C  =  IkfSaSf1);  the  cost  function,  C  —  Tr(SflSf 1 );  and  the  ratios  of 
determinants  |Sn|/|Sr|  and  |§a|/|Sr|.  The  method  of  solution  of  the  least  squares  problem 
use*  an  iterative  scheme  to  minimise  the  error  (see  Appendix  B)  and  the  cost  functions  are 
evaluated  at  each  stage  of  the  iteration. 

Although  this  is  a  rather  artificial  problem,  it  is  one  which  can  not  be  solved  by  a  linear 
transformation  from  input  space  to  output  space,  or  equivalently,  a  network  with  four  linear 
transfer  functions  at  the  hidden  layer  (4j.  Performing  a  least-means-squares  mapping  from 
the  8  input  units  directly  to  the  5  output  units,  and  classifying  the  patterns  according  to 
the  minimum  Euclidean  distance  in  the  output  space  gives  132  (=  51.56%)  correct  solutions 
(and  one  indeterminate  solution  since  the  null  vector  maps  on  to  the  null  vector,  which  is 
equi-distant  from  all  classes).  A  network  with  four  (linear)  hidden  units  achieves  the  same 
performance,  though  the  addition  of  the  biases  does  enable  the  null  vector  to  be  classified 
correctly.  Figure  1  plots  the  cost  functions  as  a  function  of  iteration  number  in  the  error 
minimisation  routine  (in  fact,  for  this  example,  we  have  chosen  to  use  an  inefficient  algorithm 
-  steepest  descents  -  to  solve  for  the  parameters  of  the  network,  since  the  BFGS  routine 
used  in  the  nonlinear  problems  converged  in  too  few  iterations  to  illustrate  the  problem). 
The  figure  shows  that  both  the  trace  cost  functions  reach  a  maximum  at  the  end  of  the 
iteration  whilst  the  ratio  of  determinants  cost  function  reaches  a  maximum  after  about  20 
iterations  and  then  starts  to  decline.  This  is  because  the  solution  for  the  matrix  of  weights 
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connecting  the  input  units  to  the  hidden  units  is  not  a  minimum  norm  solution  since  it  has 
components  in  the  null  space  of  the  between  class  covariance  matrix  §b 

Introducing  a  nonlinear  transfer  function  at  each  of  the  hidden  units  gives  improved 
classification  performance.  Figure  2  plots  the  cost  functions  as  a  function  of  iteration 
number  for  a  network  with  four  hidden  units.  For  the  particular  random  start  configuration 
of  the  weights  and  biases  chosen,  the  network  achieved  189  (73.83%)  correct  solutions. 
The  network  coat  function  increases  monotonically  with  the  number  of  iterations  of  the 
algorithm.  The  sum-squared  error  at  the  output  decreases  monotonically  correspondingly. 
However,  the  cost  function  C  =  7>(§a8f  *)  settles  at  a  value  which  is  not  its  peak  value 
during  the  iteration.  Both  cost  functions  which  depict  the  ratio  of  determinants  of  the 
between  class  and  total  covariance  matrices  are  not  maximised.  In  fact,  in  the  situation 
where  the  number  of  hidden  units  is  equal  to  one  leas  than  the  number  of  classes  (no  =  n'-l), 
as  illustrated  in  this  example,  the  determinants  [Sal  and  |Sb|  are  related  by 

|S*|  =  n’P^|Sa|  (38) 

Thus  both  cost  functions  exhibit  the  same  behaviour,  as  observed  in  the  figures. 

With  a  nonlinear  network,  we  are  not  restricted  to  having  fewer  hidden  units  than  the 
number  of  classes  as  in  linear  discriminant  analysis  and  Figure  3  plots  the  cost  functions 
as  a  function  of  iteration  number  for  a  network  with  6  hidden  units.  With  this  number  of 
hidden  units,  the  determinant  of  the  between-class  covariance  matrix  is  identically  equal  to 
sero  since  the  dimension  of  the  matrix  Sb  is  greater  than  the  number  of  classes.  Therefore, 
it  is  not  meaningful  to  use  the  ratio-of-determinants  cost  function  as  a  measure  of  clas¬ 
sification  performance.  Consequently,  an  additional  cost  function,  7V{SB}/7>{Sr),  the 
ratio  of  traces  of  the  between  class  and  total  covariance  matrices  has  also  been  plotted  for 
comparison.  Note  that  for  the  particular  random  start  configuration  used  for  this  figure, 
the  network  achieved  192  (=  75%)  correctly  classified  solutions. 

This  figure  shows  that  the  Network  Cost  Function  is  maximised  as  the  error  is  minimised 
but  the  trace  of  the  product  of  the  conventional  between  class  covariance  matrix  and  the 
inverse  of  the  total  covariance  matrix  is  not  maximised.  For  a  larger  number  of  hidden 
units  we  find,  for  this  particular  problem,  that  the  matrix  Sr  becomes  singular  during  the 
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Figure  I:  Plots  of  the  network  cost  function  TWS.c-it  ..  ..  ,  ,  , 
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F'gure  2:  Plot,  of  the  network  cos.  function  7V(SBSf)  divided  by  ||Tj|*  (solid  line),  the 

Zh  S’T  ‘V4  Jby  l|T|1*  (d0Ued  “**>1  the  "*>  of  determinant,  |8,|/|8r| 

fdihSf  hf  (dot-^“h),  and  the  ratio  of  determinant,  |S„|/|Sr|  multiplied  by  JO6 
(daahed  line)  a,  a  function  of  iteration  number  in  a  least-square,  minimi»ation  routine,  for 
a  network  with  four  hidden  unit,  with  nonlinear  transfer  function*. 
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iteration  and  therefore  the  more  genera)  form  of  the  network  cost  function  (19)  must  be 
used. 

These  three  figures  illustrate  a  natural  evolution  of  discriminant  analysis  strategies.  The 
linear  network  produces  an  optimum  linear  transformation  to  a  dimension  reducing  subspace 
where  the  patterns  corresponding  to  different  classes  are  in  some  sense  maximally  separated, 
and  the  patterns  within  each  class  are  grouped  (this  is  the  example  illustrated  by  Figure  1). 
The  next  step  is  to  allow  for  a  nonlinear  transformation  on  to  a  dimension  reducing  subspace 
which  should  have  the  advantage  of  providing  a  better  class  discrimination  transformation 
(the  example  illustrated  by  Figure  2).  The  final  stage  (Figure  3)  is  to  allow  for  an  embedding 
of  the  input  patterns  by  a  nonlinear  transformation  to  a  higher  dimensional  space  where  an 
even  better  class  separation  may  be  achieved.  Once  a  transformation  has  been  performed 
into  a  space  where  the  transformed  patterns  are  more  easily  distinguished,  it  is  much  easier 
for  a  linear  discrimination  (the  hidden-output  layer  of  the  multi-layer  perceptron  considered 
in  the  paper)  to  perform  good  classification.  These  general  comments  are  reflected  in  the 
classification  performance  of  the  figures  which  rises  from  132  to  192  patterns  classified 
correctly.  However,  note  that  in  all  instances,  the  criterion  for  maximal  class  separation  is 
determined  by  the  Network  Cost  Function.  This  may  not  be  the  best  criterion  to  choose  for 
a  general  discrimination  problem,  but  it  is  the  only  one  that  such  an  adaptive  feed-forward 
layered  network  can  employ. 

5  Conclusion. 

Adaptive  feed-forward  layered  networks  are  capable  of  performing  classification  tasks  better 
than  traditional  methods.  This  paper  has  demonstrated  that  this  ability  arises  out  of  the 
implicit  way  in  which  a  Network  Cost  Function  is  maximised  in  the  space  of  the  hidden 
units. 

Specifically,  this  paper  considered  a  general  nonlinear  transformation  from  the  input 
patterns  to  a  set  of  patterns  in  the  space  defined  by  the  final  layer  of  hidden  units  (there  is 
no  restriction  on  the  number  of  layers  constituting  the  nonlinear  transformation)  followed 
by  a  linear  transformation  to  a  set  of  output  target  patterns.  If  the  network  weights  are 
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Figure  3:  Plot*  of  the  network  cost  function  TY(SbS^1)  divided  by  ||T||*  (solid  line), 
the  cost  function  l^leSf1)  divided  by  ||T|j*  (dotted  line)  and  the  ratio  of  traces 
Tr{Ss}/7>{Sr }  multiplied  by  a  factor  of  10  (dashed  line)  as  a  function  of  iteration  number 
in  a  least-squares  minimisation  routine,  for  a  network  with  at*  hidden  units  with  nonlinear 
transfer  functions. 
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adjusted  to  minimise  the  mean-square  error  between  the  desired  target  patterns  and  the 
actual  output  patterns  of  the  network,  then  this  is  equivalent  to  maximising  the  Network 
Cost  Function 

C  =  Tr  |tH’8JHT*  J  (39) 

where  T  and  H  are  defined  in  equation  (14)  and  Sr  is  the  total  covariance  matrix  of  the 
patterns  at  the  outputs  of  the  final  hidden  layer. 

If  this  total  covariance  matrix  is  full  rank  (which  is  usually  the  case)  maximising  the 
Network  Cost  Function  is  equivalent  to  maximising  the  cost  function 

C  =  Tr{SflSr'}  (40) 

The  matrix  Sg  may  be  interpreted  to  be  the  weighted  between  class  covariance  matrix 
at  the  output  of  the  final  hidden  layer  if  the  target  patterns  are  chosen  as  a  1-from-n' 
coding.  Equivalently,  encoding  the  distribution  of  patterns  between  the  classes  into  the 
target  patterns  (which  is  equivalent  to  weighting  the  error  minimisation)  allows  the  matrix 
Sg  to  be  interpreted  as  the  conventional  between  class  covariance  matrix. 

The  action  of  a  feed-forward  network  does  not  maximise  more  traditional  cost  functions 
employed  in  discrimination  analysis,  as  our  numerical  example  illustrated.  However  in  the 
special  case  of  a  totally  linear  network  with  one  hidden  layer,  the  minimum  norm  solution 
which  maximises  the  ratio  of  determinants  of  the  between  class  and  total  covariance  matrices 
(the  result  obtained  by  Gallinari  et.ol  [4])  is  equivalent  to  maximising  the  Network  Cost 
Function. 

Thus  an  adaptive  feed-forward  layered  network  performs  a  natural  generalisation  of  lin¬ 
ear  discriminant  analysis  by  implicitly  maximising  a  cost  function  relating  the  between  class 
and  total  covariance  matrices.  This  is  precisely  why  such  networks  have  been  demonstrated 
to  perform  classification  tasks  well.  It  should  be  possible  to  force  sucn  networks  to  max¬ 
imise  alternative  network  cost  functions  more  appropriate  to  a  specific  task,  by  minimising 
different  error  measures  to  the  mean-square-error  considered  in  this  paper.  Alternatively, 
given  a  cost  function  which  it  u  desired  to  maximise  explicitly,  what  error  function  should 
be  minimised  by  first  performing  a  linear  transformation  to  a  classification  space  which 
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would  implicitly  achieve  the  same  effect.  In  this  sense,  the  results  of  this  paper  may  have 
a  wider  applicability  in  'designer  networks’  for  specific  applications. 
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A  Proof  of  the  cost  function  equation. 

In  section  3,  it  was  shown  how  minimising  the  error  was  equivalent  to  maximising  the  cost 
function  in  equation  (19),  i.t. 


C^rlTH'S^T'} 
=  TrD 


(41) 


If  we  consider  the  eigenvalue  equation  of  matrix  D 

D|*>  =  q|*>  (42) 

and  operate  through  the  equation  on  the  left  by  the  matrix  HT*,  one  finds 

(HT*TH-)s+|y)  =  q|y>  (43) 

This  is  an  eigenvalue  equation  for  the  matrix 

E  =  6±*th,s+ 

with  different  eigenfunctions 

All  the  eigenvalues  of  D  are  also  eigenvalues  of  E  (but  note  that  E  may  have  additional 
eigenvalues  to  D).  Since  the  trace  of  a  matrix  is  the  sum  of  its  eigenvalues,  then  the  trace  of 
D  is  less  than  or  equal  to  the  trace  of  matrix  E.  In  the  case  of  a  full  rank  total  covariance 
matrix,  87-,  then  E  has  the  same  rank  as  D.  Thus,  E  has  the  same  number  of  eigenvalues  of 
D  and  so  TrD  =  TrE.  Consequently,  maximising  the  Network  Cost  Function  is  equivalent 
to  maximising  the  cost  function  given  in  equation  (21),  as  desired,  a 

Provided  that  the  total  covariance  matrix  is  full  rank,  then  the  pseudo-inverse  equals  the 
true  inverse  and  the  above  conclusions  may  be  reversed:  maximising  the  cost  function  (21) 
is  equivalent  to  maximising  the  cost  function  (19).  Thus  either  of  the  cost  functions  (19), 
(21)  may  be  taken  to  be  the  Network  Cost  Function.  This  is  often  the  case  in  practice. 
Unfortunately,  this  conclusion  does  not  apply  if  the  total  covariance  matrix  By  is  not 
full  rank  since  there  will  exist  eigenvalues  of  E  which  are  not  eigenvalues  of  D  and  so 
TrE  >  TrD. 
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B  Numerical  Solution  of  the  Least-squares  Problem. 


In  the  numerical  example  used  to  illustrate  the  theorem,  the  network  employed  had  a 
single  hidden  layer  with  the  output  nonlinearity  of  the  hidden  units  described  by  a  logistic 
function, 


*{*)  = 


1  +  exp(-x) 


and  an  output  layer  employing  linear  units. 


(44) 


The  square  error  at  the  output  of  the  network  is  regarded  as  a  nonlinear  function  of 
the  weights  and  biases  between  the  input  layer  and  the  hidden  layer.  This  error  may  be 
minimised  using  any  suitable  nonlinear  function  optimisation  strategy  [2] ,  and  we  have 
chosen  to  use  a  quasi-Newton  technique,  the  BFGS  method. 


The  minimisation  proceeds  as  follows.  Given  an  initial  estimate,  {|/i(t  =  0))},  of  the 
weights  and  biases  (|/s)}  (chosen  from  a  uniform  random  distribution  in  the  interval  (-1,1)) 
between  the  input  and  hidden  layers,  the  final  layer  weights  and  biases,  (|A)},  are  calculated 
using  equations  (16)  and  (8)  and  the  value  of  the  output  error  obtained.  The  gradient  of 
the  error  with  respect  to  the  parameters  ()p)}  may  then  be  calculated  from  Equation  (3). 

Thus,  given  an  initial  position  and  an  initial  search  direction  (taken  to  be  the  direction 
of  the  downhill  gradient),  the  algorithm  performs  a  search  along  this  direction  to  obtain  an 
estimate  of  the  minimum  of  the  error  in  this  direction.  Once  this  has  been  achieved,  a  new 
search  direction  is  generated  (using  the  BFGS  prescription)  and  a  search  performed  to  find 
the  minimum  of  the  error  in  this  new  direction.  This  procedure  continues  until  convergence. 
Note  that  each  time  that  the  error  is  evaluated  (for  each  new  estimate  of  the  parameters 
(|/i)})  the  values  of  the  parameters  (|A)}  must  be  obtained  using  equations  (16)  and  (8) 
prior  to  evaluation  of  the  error.  In  this  way,  the  values  of  the  parameters  (|A))  are  tied 
to  the  values  of  { j/s) } .  This  ensures  that  the  method  produces  a  global  minimum  in  the 
subspace  spanned  by  the  parameters  {(A)}. 

Note  that  the  search  strategy  to  find  a  minimum  of  the  nonlinear  function,  could  have 
been  performed  by  a  standard  (accelerated)  steepest  descents  procedure.  In  our  experi¬ 
ence  [2],  this  would  have  taken  at  least  an  order  of  magnitude  longer  in  terms  of  CPU  time, 
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or  the  number  of  iterations.  It  was  decided  that  the  BPGS  procedure  was  one  of  the  more 
efficient  techniques  to  use  for  this  site  of  problem. 
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